This dataset is created based on data file retrieved from Kaggle - Massive Yahoo Finance Data. The collected data includes 5 years of daily share prices with low, high, open and close values, trade volumes and dividends for the top 500 companies. Since analyzing shares based on broader category such as Market capitalization will give meaningful insight on the performance, additional category called Market-Cap and monthly percentage of returns are derived and added to this data set.
The goal of this analysis is that overall market trend as a whole as well as based on broader category such as Market-Cap, dividend provided by companies and volatility of daily shares.
The following box plot depicts the overall market trends across the top 500 companies for the last 5 years. Monthly returns for a share is calculated as percentage of increase in Close value of share from the first day of the month to the last day of the month.
library(plotly)
library(tidyverse)
library(lubridate)
stock_details_5_years <- read.csv("./stock_details_5_years.csv")
stock_details_5_years_tbl <- as.tibble(stock_details_5_years) |>
mutate(Date = as.Date(str_sub(Date, end = 10)));stock_details_5_years_tbl## # A tibble: 602,962 Ă— 9
## Date Open High Low Close Volume Dividends Stock.Splits Company
## <date> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
## 1 2018-11-29 43.8 43.9 42.6 43.1 167080000 0 0 AAPL
## 2 2018-11-29 105. 106. 104. 105. 28123200 0 0 MSFT
## 3 2018-11-29 54.2 55.0 54.1 54.7 31004000 0 0 GOOGL
## 4 2018-11-29 83.7 84.5 82.6 83.7 132264000 0 0 AMZN
## 5 2018-11-29 39.7 40.1 38.7 39.0 54917200 0.04 0 NVDA
## 6 2018-11-29 136. 140. 136. 139. 24238700 0 0 META
## 7 2018-11-29 23.1 23.2 22.6 22.7 46210500 0 0 TSLA
## 8 2018-11-29 106. 109. 106. 108. 4688300 0 0 LLY
## 9 2018-11-29 136. 136. 134. 134. 8751500 0 0 V
## 10 2018-11-29 33.5 33.9 33.5 33.5 7056600 0 0 TSM
## # ℹ 602,952 more rows
monthly_returns_5_years <-
stock_details_5_years_tbl |> arrange(Date, Company) |>
group_by(yr = year(Date), mon = month(Date), Company) |>
summarise(monthlyReturnsPct = (Close[which.max(date(Date))] - Open[which.min(date(Date))])*100/Open[which.min(date(Date))]) |>
mutate(Month = as.Date(paste(yr, mon, 1, sep = "/"))) |>
ungroup() |> select(Month, Company, monthlyReturnsPct )
plot_ly(monthly_returns_5_years, x=~Month, y=~monthlyReturnsPct, type = "box",
marker = list(color = "black")) |>
layout(xaxis = list(title = "Month", tickvals = ~Month, ticktext = ~Month),
yaxis = list(title = "Monthly Returns %"))The above plot give overall picture of market for last 5 years. The monthly box plots show if it’s up-market or down-market based on the position of IQR in the box plots on the y axis.
Companies are categorized based Market capitalization derived from latest available data.
As per central limit theorem, the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. The below plots depicting distribution on percentage of swing in daily share prices for various sample sizes adhere to this theorem.
options(digits=2)
getMeansOfSamples <- function(x, sampleSizes, samples) {
set.seed(4735)
xbars <- matrix(rep(0, samples * length(sampleSizes)),nrow = length(sampleSizes), byrow = TRUE)
for(i in 1:length(sampleSizes)){
for(j in 1:samples) {
xbars[i,j] <- mean(sample(x, sampleSizes[i], replace = FALSE))
}
}
xbars
}
samples = 1000
sampleSizes = c(10, 20, 30, 40)
# Daily Share price swing percentage for random stock in five years
xbars <- getMeansOfSamples(DailyPriceSwing$VariationPercent, sampleSizes, samples)
fig <- list()
for(i in 1:length(sampleSizes)) {
print(paste("Mean:", round(mean(xbars[i,]), 2), "SD:", round(sd(xbars[i,]), 2)))
fig[[i]]<- plot_ly(x = ~xbars[i,], type = "histogram",
name = paste("Samplesize",
sampleSizes[i], sep = ":" )) |>
layout(xaxis = list(range = c(1,5), title = "Daily Share price variation Percentage"),
yaxis = list(range = c(0,100), title = "Density"))
}## [1] "Mean: 2.52 SD: 0.58"
## [1] "Mean: 2.57 SD: 0.45"
## [1] "Mean: 2.56 SD: 0.35"
## [1] "Mean: 2.55 SD: 0.3"
The above plot shows the distribution of daily price in shares is skewed to the right side similar to the original plot.
There are three type of sampling methods are used to analyse similar analysis as above. They are simple random sampling without replacement, systematic unequal probabilities and stratified sampling with equal sized data.
options(digits=2)
library(sampling)
# Various sampling methods
#Simple random Sampling without replacement
sampleSize <- 50
populationSize <- length(DailyPriceSwing$VariationPercent)
set.seed(4735)
s <- srswor(sampleSize, populationSize)
rows <- rep(seq(1:populationSize), s)
par(mfrow = c(2,2))
plot_ly(y =~prop.table(table(DailyPriceSwing$VariationPercent[rows])), type = "bar",
name = "Simple Randon Sampling without replacement") |>
layout(title = "Simple Randon Sampling without replacement",
xaxis = list(range = c(0,15), title = "Daily Price Swing %"),
yaxis = list(range = c(0,0.4), title = "Proportion")) # Systematic sampling
library(sampling)
pik <- inclusionprobabilities(DailyPriceSwing$VariationPercent, sampleSize)
s <- UPsystematic(pik)
sytematicSamples <- DailyPriceSwing$VariationPercent[s!=0]
plot_ly(y =~prop.table(table(sytematicSamples)), type = "bar",
name = "Systematic Sampling un-equal Probabilities") |>
layout(title = "Systematic Sampling un-equal Probabilities",
xaxis = list(range = c(0,15), title = "Daily Price Swing %"),
yaxis = list(range = c(0,0.4), title = "Proportion")) # Stratified, equal sized strata
sts <- sampling::strata(DailyPriceSwing, stratanames = c("Company"), size = rep(3, 500),
method = "srswor", description = FALSE)
stSamples <- sampling::getdata(DailyPriceSwing, sts)
plot_ly(y =~prop.table(table(stSamples$VariationPercent)), type = "bar",
name = "Stratified Sampling") |>
layout(title = "Stratified Sampling",
xaxis = list(range = c(0,15), title = "Daily Price Swing %"),
yaxis = list(range = c(0,0.4), title = "Proportion")) From the above plots it’s clear that, those three samples on the original data also distributed normally and skewed to the right similar to the original one.
From the above analysis on last 5 years of share market data for top 500 companies below observations are made.